## datatable function from DT package create an HTML widget display of the dataset
## install DT package if the package is not yet available in your R environment
readxl::read_excel("dataset/dataset-variable-description.xlsx") |>
DT::datatable()Midterm Project Exercise: HR ANALYTICS EMPLOYEE ATTRITION AND PERFORMANCE
BCon 147: special topics
1 Project overiew
In this project, we will explore employee attrition and performance using the HR Analytics Employee Attrition & Performance dataset. The primary goal is to develop insights into the factors that contribute to employee attrition. By analyzing a range of factors, including demographic data, job satisfaction, work-life balance, and job role, we aim to help businesses identify key areas where they can improve employee retention.
2 Scenario
Imagine you are working as a data analyst for a mid-sized company that is experiencing high employee turnover, especially among high-performing employees. The company has been facing increased costs related to hiring and training new employees, and management is concerned about the negative impact on productivity and morale. The human resources (HR) team has collected historical employee data and now looks to you for actionable insights. They want to understand why employees are leaving and how to retain talent effectively.
Your task is to analyze the dataset and provide insights that will help HR prioritize retention strategies. These strategies could include interventions like revising compensation policies, improving job satisfaction, or focusing on work-life balance initiatives. The success of your analysis could lead to significant cost savings for the company and an increase in employee engagement and performance.
3 Understanding data source
The dataset used for this project provides information about employee demographics, performance metrics, and various satisfaction ratings. The dataset is particularly useful for exploring how factors such as job satisfaction, work-life balance, and training opportunities influence employee performance and attrition.
This dataset is well-suited for conducting in-depth analysis of employee performance and retention, enabling us to build predictive models that identify the key drivers of employee attrition. Additionally, we can assess the impact of various organizational factors, such as training and work-life balance, on both performance and retention outcomes.
4 Data wrangling and management
Libraries
Before we start working on the dataset, we need to load the necessary libraries that will be used for data wrangling, analysis and visualization. Make sure to load the following libraries here. For packages to be installed, you can use the install.packages function. There are packages to be installed later on this project, so make sure to install them as needed and load them here.
# load all your libraries here
library(readr)
library(dplyr)
library(DT)
library(janitor)
library(ggplot2)
library(plotly)
library(GGally)
library(stats)
library(sjPlot)
library(gridExtra)
library(report)
library(ggstatsplot)
library(scales)
library(tidyr)4.1 Data importation
Import the two dataset
Employee.csvandPerformanceRating.csv. Save theEmployee.csvasemployee_dtaandPerformanceRating.csvasperf_rating_dta.Merge the two dataset using the
left_joinfunction fromdplyr. Use theEmployeeIDvariable as the varible to join by. You may read more information about theleft_joinfunction here.Save the merged dataset as
hr_perf_dtaand display the dataset using thedatatablefunction fromDTpackage.
## import the two data here
employee_dta <- read_csv("C:/Users/1/Desktop/MY VSU/4TH YEAR/1st Semester/Special Topic/Midterm Project/midterm-bcon147-project-exercise-20241017T013024Z-001/midterm-bcon147-project-exercise/dataset/Employee.csv")
perf_rating_dta <- read_csv("C:/Users/1/Desktop/MY VSU/4TH YEAR/1st Semester/Special Topic/Midterm Project/midterm-bcon147-project-exercise-20241017T013024Z-001/midterm-bcon147-project-exercise/dataset/PerformanceRating.csv")
## merge employee_dta and perf_rating_dta using left_join function.
## save the merged dataset as hr_perf_dta
hr_perf_dta <- left_join(employee_dta, perf_rating_dta, by = "EmployeeID")
## Use the datatable from DT package to display the merged dataset
datatable(hr_perf_dta)4.2 Data management
Using the
clean_namesfunction fromjanitorpackage, standardize the variable names by using the recommended naming of variables.Save the renamed variables as
hr_perf_dtato update the dataset.
## clean names using the janitor packages and save as hr_perf_dta
hr_perf_dta <- hr_perf_dta |>
clean_names()
## display the renamed hr_perf_dta using datatable function
datatable(hr_perf_dta)Create a new variable
cat_educationwhereineducationis1=No formal education;2=High school;3=Bachelor;4=Masters;5=Doctorate. Use thecase_whenfunction to accomplish this task.Similarly, create new variables
cat_envi_sat,cat_job_sat, andcat_relation_satforenvironment_satisfaction,job_satisfaction, andrelationship_satisfaction, respectively. Re-code the values accordingly as1=Very dissatisfied;2=Dissatisfied;3=Neutral;4=Satisfied; and5=Very satisfied.Create new variables
cat_work_life_balance,cat_self_rating,cat_manager_ratingforwork_life_balance,self_rating, andmanager_rating, respectively. Re-code accordingly as1=Unacceptable;2=Needs improvement;3=Meets expectation;4=Exceeds expectation; and5=Above and beyond.Create a new variable
bi_attritionby transformingattritionvariable as a numeric variabe. Re-code accordingly asNo=0, andYes=1.Save all the changes in the
hr_perf_dta. Note that saving the changes with the same name will update the dataset with the new variables created.
hr_perf_dta <- hr_perf_dta |>
## create cat_education
mutate(
cat_education = case_when(
education == 1 ~ "No formal education",
education == 2 ~ "High school",
education == 3 ~ "Bachelor",
education == 4 ~ "Masters",
education == 5 ~ "Doctorate",
TRUE ~ NA_character_ #ensures that any unrecognized values in the `education` column are assigned as `NA`
),
## create cat_envi_sat, cat_job_sat, and cat_relation_sat
cat_envi_sat = case_when(
environment_satisfaction == 1 ~ "Very dissatisfied",
environment_satisfaction == 2 ~ "Dissatisfied",
environment_satisfaction == 3 ~ "Neutral",
environment_satisfaction == 4 ~ "Satisfied",
environment_satisfaction == 5 ~ "Very satisfied",
TRUE ~ NA_character_
),
cat_job_sat = case_when(
job_satisfaction == 1 ~ "Very dissatisfied",
job_satisfaction == 2 ~ "Dissatisfied",
job_satisfaction == 3 ~ "Neutral",
job_satisfaction == 4 ~ "Satisfied",
job_satisfaction == 5 ~ "Very satisfied",
TRUE ~ NA_character_
),
cat_relation_sat = case_when(
relationship_satisfaction == 1 ~ "Very dissatisfied",
relationship_satisfaction == 2 ~ "Dissatisfied",
relationship_satisfaction == 3 ~ "Neutral",
relationship_satisfaction == 4 ~ "Satisfied",
relationship_satisfaction == 5 ~ "Very satisfied",
TRUE ~ NA_character_
),
## create cat_work_life_balance, cat_self_rating, and cat_manager_rating
cat_work_life_balance = case_when(
work_life_balance == 1 ~ "Unacceptable",
work_life_balance == 2 ~ "Needs improvement",
work_life_balance == 3 ~ "Meets expectation",
work_life_balance == 4 ~ "Exceeds expectation",
work_life_balance == 5 ~ "Above and beyond",
TRUE ~ NA_character_
),
cat_self_rating = case_when(
self_rating == 1 ~ "Unacceptable",
self_rating == 2 ~ "Needs improvement",
self_rating == 3 ~ "Meets expectation",
self_rating == 4 ~ "Exceeds expectation",
self_rating == 5 ~ "Above and beyond",
TRUE ~ NA_character_
),
cat_manager_rating = case_when(
manager_rating == 1 ~ "Unacceptable",
manager_rating == 2 ~ "Needs improvement",
manager_rating == 3 ~ "Meets expectation",
manager_rating == 4 ~ "Exceeds expectation",
manager_rating == 5 ~ "Above and beyond",
TRUE ~ NA_character_
),
## create bi_attrition
bi_attrition = case_when(
attrition == "No" ~ 0,
attrition == "Yes" ~ 1,
TRUE ~ NA_real_ # indicating a missing value (NA) that is of numeric type
)
)
## print the updated hr_perf_dta using datatable function
datatable(hr_perf_dta)5 Exploratory data analysis
5.1 Descriptive statistics of employee attrition
Select the variables
attrition,job_role,department,age,salary,job_satisfaction, andwork_life_balance.Save asattrition_key_var_dta.Compute and plot the attrition rate across
job_role,department, andage,salary,job_satisfaction, andwork_life_balance. To compute for the attrition rate, group the dataset by job role. Afterward, you can use thecountfunction to get the frequency of attrition for each job role and then divide it by the total number of observations. Save the computation aspct_attrition. Do not forget to ungroup before storing the output. Store the output asattrition_rate_job_role.Plot for the attrition rate across
job_rolehas been done for you! Study each line of code. You have the freedom to customize your plot accordingly. Show your creativity!
## selecting attrition key variables and save as `attrition_key_var_dta`
attrition_key_var_dta <- hr_perf_dta |>
select(attrition, job_role, department, age, salary, job_satisfaction, cat_job_sat, work_life_balance, cat_work_life_balance)
## compute the attrition rate across job_role and save as attrition_rate_job_role
attrition_rate_job_role <- attrition_key_var_dta |>
group_by(job_role) |>
count(attrition) |>
mutate(pct_attrition = n / sum(n)) |>
ungroup()
##Christine's added comment: Filter only attrition cases (attrition == "Yes")
attrition_rate_job_role <- attrition_rate_job_role |>
filter(attrition == "Yes")
## print attrition_rate_job_role
attrition_rate_job_role# A tibble: 11 × 4
job_role attrition n pct_attrition
<chr> <chr> <int> <dbl>
1 Analytics Manager Yes 28 0.131
2 Data Scientist Yes 597 0.430
3 Engineering Manager Yes 18 0.0586
4 HR Executive Yes 29 0.244
5 Machine Learning Engineer Yes 95 0.163
6 Manager Yes 19 0.131
7 Recruiter Yes 86 0.566
8 Sales Executive Yes 543 0.347
9 Sales Representative Yes 317 0.634
10 Senior Software Engineer Yes 84 0.164
11 Software Engineer Yes 445 0.324
## Plot the attrition rate
p <- ggplot(attrition_rate_job_role, aes(x = reorder(job_role, pct_attrition), y = pct_attrition)) + #By default, R orders categorical variables alphabetically which does not convey any meaningful information about attrition rates. Reorder function was utilized to change the order of the job_role categories based on their associated pct_attrition values, from lowest to highest attrition.
geom_col(fill = "#70945f", color = "black", width = 0.7, size = 0.15) +
labs(title = "Attrition Rate by Job Role",
x = "Job Role",
y = "Attrition Rate (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1), # for readability in the variable names placed in the x-axis
panel.grid = element_line(color = "grey", size = 0.5),
plot.title = element_text(face = "bold", size = 16.2, hjust = 0.5)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1)) #accuracy within percent_format is used to round to the nearest whole number
#custom tooltip with percentage of attrition
p <- p + aes(text = paste("Attrition Rate:",
"<b>", scales::percent(pct_attrition, accuracy = 1), "</b>"))
ggplotly(p, tooltip = "text")## Christine's added code chunk
## compute the attrition rate across department
attrition_rate_department <- attrition_key_var_dta |>
group_by(department) |>
count(attrition) |>
mutate(pct_attrition = n / sum(n)) |>
ungroup()
##Filter only attrition cases (attrition == "Yes")
attrition_rate_department <- attrition_rate_department |>
filter(attrition == "Yes")
## Plot the attrition rate
q <- ggplot(attrition_rate_department, aes(x = reorder(department, pct_attrition), y = pct_attrition)) +
geom_col(fill = "#2c494a", color = "black", width = 0.7, size = 0.15) +
labs(title = "Attrition Rate by Department",
x = "Department",
y = "Attrition Rate (%)") +
theme_minimal() +
theme(panel.grid = element_line(color = "grey", size = 0.5),
plot.title = element_text(face = "bold", size = 16.2, hjust = 0.5)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1))
#custom tooltip with percentage of attrition
q <- q + aes(text = paste("Attrition Rate:",
"<b>", scales::percent(pct_attrition, accuracy = 1), "</b>"))
ggplotly(q, tooltip = "text")## Christine's added code chunk
##Check first the min and max value
summary(attrition_key_var_dta$age) Min. 1st Qu. Median Mean 3rd Qu. Max.
18.0 25.0 28.0 30.6 36.0 51.0
##Check the data type of age variable
str(attrition_key_var_dta) tibble [6,899 × 9] (S3: tbl_df/tbl/data.frame)
$ attrition : chr [1:6899] "No" "No" "No" "No" ...
$ job_role : chr [1:6899] "Sales Executive" "Sales Executive" "Sales Executive" "Sales Executive" ...
$ department : chr [1:6899] "Sales" "Sales" "Sales" "Sales" ...
$ age : num [1:6899] 30 30 30 30 30 30 30 30 30 38 ...
$ salary : num [1:6899] 102059 102059 102059 102059 102059 ...
$ job_satisfaction : num [1:6899] 3 4 5 3 4 2 5 2 5 3 ...
$ cat_job_sat : chr [1:6899] "Neutral" "Satisfied" "Very satisfied" "Neutral" ...
$ work_life_balance : num [1:6899] 4 2 4 3 3 3 4 2 5 5 ...
$ cat_work_life_balance: chr [1:6899] "Exceeds expectation" "Needs improvement" "Exceeds expectation" "Meets expectation" ...
#filter non-numeric values before converting to avoid data loss (which i personally experience lol)
attrition_key_var_dta <- attrition_key_var_dta |>
filter(!is.na(age))
##convert to numeric since it is stored as factor
attrition_key_var_dta <- attrition_key_var_dta |>
mutate(age = as.numeric(as.character(age)))
# Create age groups
attrition_key_var_dta <- attrition_key_var_dta |>
mutate(age_group = cut(age,
breaks = c(18, 23, 28, 33, 38, 43, 48, 53),
labels = c("18-23", "23-28", "28-33", "33-38", "38-43", "43-48", "48+"),
right = FALSE)) # to exclude upper bound
## compute the attrition rate across age
attrition_rate_age <- attrition_key_var_dta |>
group_by(age_group) |>
count(attrition) |>
mutate(pct_attrition = n / sum(n)) |>
ungroup()
##Filter only attrition cases (attrition == "Yes")
attrition_rate_age <- attrition_rate_age |>
filter(attrition == "Yes")
## Plot the attrition rate
r <- ggplot(attrition_rate_age, aes(x = age_group, y = pct_attrition)) +
geom_col(fill = "#ff84c3", color = "black", width = 0.7, size = 0.15) +
labs(title = "Attrition Rate by Age",
x = "Age",
y = "Attrition Rate (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_line(color = "grey", size = 0.5),
plot.title = element_text(face = "bold", size = 16.2, hjust = 0.5)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1))
#custom tooltip with percentage of attrition
r <- r + aes(text = paste("Attrition Rate:",
"<b>", scales::percent(pct_attrition, accuracy = 1), "</b>"))
ggplotly(r, tooltip = "text")## Christine's added code chunk
##Check first the min and max value
summary(attrition_key_var_dta$salary) Min. 1st Qu. Median Mean 3rd Qu. Max.
20387 44646 74458 110898 137220 547204
# Set minimum and maximum salary values
min_salary <- 20387
max_salary <- 547204
# Define bin width (choose based on your analysis goals)
bin_width <- 50000
# Calculate breaks based on the bin width
breaks <- seq(min_salary, max_salary + bin_width, by = bin_width)
# Create salary groups
attrition_key_var_dta <- attrition_key_var_dta |>
mutate(salary_group = cut(salary,
breaks = breaks,
labels = paste(head(breaks, -1), tail(breaks, -1), sep = "-"),
right = FALSE))
## compute the attrition rate across age
attrition_rate_salary <- attrition_key_var_dta |>
group_by(salary_group) |>
count(attrition) |>
mutate(pct_attrition = n / sum(n)) |>
ungroup()
##Filter only attrition cases (attrition == "Yes")
attrition_rate_salary <- attrition_rate_salary |>
filter(attrition == "Yes")
## Plot the attrition rate
s <- ggplot(attrition_rate_salary, aes(x = salary_group, y = pct_attrition)) +
geom_col(fill = "#9a6c57", color = "black", width = 0.7, size = 0.15) +
labs(title = "Attrition Rate by Salary",
x = "Salary",
y = "Attrition Rate (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_line(color = "grey", size = 0.5),
plot.title = element_text(face = "bold", size = 16.2, hjust = 0.5)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1))
#custom tooltip with percentage of attrition
s <- s + aes(text = paste("Attrition Rate:",
"<b>", scales::percent(pct_attrition, accuracy = 1), "</b>"))
ggplotly(s, tooltip = "text")## Christine's added code chunk
## compute the attrition rate across job satisfaction using cat_job_sat
attrition_rate_job_satisfaction <- attrition_key_var_dta |>
group_by(cat_job_sat) |>
count(attrition) |>
mutate(pct_attrition = n / sum(n)) |>
ungroup()
##Filter only attrition cases (attrition == "Yes")
attrition_rate_job_satisfaction <- attrition_rate_job_satisfaction |>
filter(attrition == "Yes")
#To reflect the right sequence in the graph
attrition_rate_job_satisfaction <- attrition_rate_job_satisfaction |>
mutate(cat_job_sat = factor(cat_job_sat,
levels = c("Very dissatisfied",
"Dissatisfied",
"Neutral",
"Satisfied",
"Very satisfied")))
## Plot the attrition rate
t <- ggplot(attrition_rate_job_satisfaction, aes(x = cat_job_sat, y = pct_attrition)) +
geom_col(fill = "#a72219", color = "black", width = 0.7, size = 0.15) +
labs(title = "Attrition Rate by Job Satisfaction",
x = "Job Satisfaction",
y = "Attrition Rate (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_line(color = "grey", size = 0.5),
plot.title = element_text(face = "bold", size = 16.2, hjust = 0.5)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1))
#custom tooltip with percentage of attrition
t <- t + aes(text = paste("Attrition Rate:",
"<b>", scales::percent(pct_attrition, accuracy = 1), "</b>"))
ggplotly(t, tooltip = "text")## Christine's added code chunk
## compute the attrition rate across work-life balance using cat_work_life_balance
attrition_rate_work_life_balance <- attrition_key_var_dta |>
group_by(cat_work_life_balance) |>
count(attrition) |>
mutate(pct_attrition = n / sum(n)) |>
ungroup()
##Filter only attrition cases (attrition == "Yes")
attrition_rate_work_life_balance <- attrition_rate_work_life_balance |>
filter(attrition == "Yes")
#To reflect the right sequence in the graph
attrition_rate_work_life_balance <- attrition_rate_work_life_balance |>
mutate(cat_work_life_balance = factor(cat_work_life_balance,
levels = c("Unacceptable",
"Needs improvement",
"Meets expectation",
"Exceeds expectation",
"Above and beyond")))
## Plot the attrition rate
u <- ggplot(attrition_rate_work_life_balance, aes(x = cat_work_life_balance, y = pct_attrition)) +
geom_col(fill = "#ffe366", color = "black", width = 0.7, size = 0.15) +
labs(title = "Attrition Rate by Work-Life Balance",
x = "Work-Life Balance",
y = "Attrition Rate (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_line(color = "grey", size = 0.5),
plot.title = element_text(face = "bold", size = 16.2, hjust = 0.5)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1))
#custom tooltip with percentage of attrition
u <- u + aes(text = paste("Attrition Rate:",
"<b>", scales::percent(pct_attrition, accuracy = 1), "</b>"))
ggplotly(u, tooltip = "text")5.2 Identifying attrition key drivers using correlation analysis
Conduct a correlation analysis of key variables:
bi_attrition,salary,years_at_company,job_satisfaction,manager_rating, andwork_life_balance. Use thecor()function to run the correlation analysis. Remove missing values using thena.omit()before running the correlation analysis. Save the output inhr_corr.Use a correlation matrix or heatmap to visualize the relationship between these variables and attrition. You can use the
GGallypackage and use theggcorrfunction to visualize the correlation heatmap. You may explore this site for more information: ggcorr.Discuss which factors seem most correlated with attrition and what that suggests about why employees are leaving.
## Christine's added code chunk
## Remove missing values
hr_perf_dta_no_NA <- na.omit(hr_perf_dta)
## conduct correlation of key variables.
hr_corr <- cor(hr_perf_dta_no_NA[, c("bi_attrition", "salary", "years_at_company", "job_satisfaction", "manager_rating", "work_life_balance")])
## print hr_corr
hr_corr bi_attrition salary years_at_company job_satisfaction
bi_attrition 1.000000000 -0.211181478 -0.6896527798 0.0132368129
salary -0.211181478 1.000000000 0.2206442116 0.0053054850
years_at_company -0.689652780 0.220644212 1.0000000000 0.0008700583
job_satisfaction 0.013236813 0.005305485 0.0008700583 1.0000000000
manager_rating -0.007654429 -0.001596736 0.0178656879 -0.0158205481
work_life_balance 0.003428836 -0.001517145 0.0079339508 0.0417242942
manager_rating work_life_balance
bi_attrition -0.007654429 0.003428836
salary -0.001596736 -0.001517145
years_at_company 0.017865688 0.007933951
job_satisfaction -0.015820548 0.041724294
manager_rating 1.000000000 0.007996938
work_life_balance 0.007996938 1.000000000
# Create a correlation heatmap
ggcorr(hr_perf_dta_no_NA[, c("bi_attrition", "salary", "years_at_company", "job_satisfaction", "manager_rating", "work_life_balance")],
label = TRUE,
label_size = 3,
label_color = "black",
hjust = 0.75,
low = "#ff3f31", high = "#149127", mid = "#ffeb38", #actual parameters for postive, zero, and negative numbers color coding
digits = 2) +
labs(title = "Correlation Heatmap of Selected Variables",
subtitle = "Analyzing Relationships Between Key Factors and Attrition") +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"), # Centered title
plot.subtitle = element_text(hjust = 0.5, size = 12))## install GGally package and use ggcorr function to visualize the correlation
install.packages("GGally")The variable that has a strong correlation with attrition, with a correlation coefficient of -7, is the variable ‘years_at_company’. The negative relationship suggests that as employees spend more years at the company, it decreases the likelihood of attrition as they may have developed strong loyalty and commitment to the company and also offers them job security.
5.3 Predictive modeling for attrition
Create a logistic regression model to predict employee attrition using the following variables:
salary,years_at_company,job_satisfaction,manager_rating, andwork_life_balance. Save the model ashr_attrition_glm_model. Print the summary of the model using thesummaryfunction.Install the
sjPlotpackage and use thetab_modelfunction to display the summary of the model. You may read the documentation here on how to customize your model summary.Also, use the
plot_modelfunction to visualize the model coefficients. You may read the documentation here on how to customize your model visualization.Discuss the results of the logistic regression model and what they suggest about the factors that contribute to employee attrition.
## run a logistic regression model to predict employee attrition
## save the model as hr_attrition_glm_model
hr_attrition_glm_model <- glm(bi_attrition ~ salary + years_at_company + job_satisfaction + manager_rating + work_life_balance,
data = hr_perf_dta_no_NA,
family = binomial) #specifies the model to be logistic regression with binary dependent variable
## print the summary of the model using the summary function
summary(hr_attrition_glm_model)
Call:
glm(formula = bi_attrition ~ salary + years_at_company + job_satisfaction +
manager_rating + work_life_balance, family = binomial, data = hr_perf_dta_no_NA)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.571e+00 2.173e-01 11.831 <2e-16 ***
salary -3.633e-06 4.086e-07 -8.893 <2e-16 ***
years_at_company -6.333e-01 1.476e-02 -42.919 <2e-16 ***
job_satisfaction 3.470e-02 3.186e-02 1.089 0.276
manager_rating 5.071e-03 3.810e-02 0.133 0.894
work_life_balance 2.587e-02 3.198e-02 0.809 0.419
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8574.5 on 6708 degrees of freedom
Residual deviance: 4781.6 on 6703 degrees of freedom
AIC: 4793.6
Number of Fisher Scoring iterations: 5
## install sjPlot package and use tab_model function to display the summary of the model
install.packages("sjPlot")
# Display the logistic regression model summary in a tabular format
tab_model(hr_attrition_glm_model,
title = "<strong style='text-align:center;'>Logistic Regression Model Predicting Employee Attrition</strong>",
show.ci = TRUE,
show.se = TRUE,
show.stat = TRUE)| bi attrition | |||||
| Predictors | Odds Ratios | std. Error | CI | Statistic | p |
| (Intercept) | 13.08 | 2.84 | 0.00 – Inf | 11.83 | <0.001 |
| salary | 1.00 | 0.00 | 0.00 – Inf | -8.89 | <0.001 |
| years at company | 0.53 | 0.01 | 0.00 – Inf | -42.92 | <0.001 |
| job satisfaction | 1.04 | 0.03 | 0.00 – Inf | 1.09 | 0.276 |
| manager rating | 1.01 | 0.04 | 0.00 – Inf | 0.13 | 0.894 |
| work life balance | 1.03 | 0.03 | 0.00 – Inf | 0.81 | 0.419 |
| Observations | 6709 | ||||
| R2 Tjur | 0.502 | ||||
## use plot_model function to visualize the model coefficients
z <- plot_model(hr_attrition_glm_model,
type = "est",
show.values = TRUE,
value.offset = 0.3,
title = "Model Coefficients - Employee Attrition",
ci.lvl = 0.95,
colors = "Set1",
axis.title = c("Variables", "Coefficient Estimate"),
value.size = 4,
axis.labels = c("Work-Life Balance", "Manager Rating", "Job Satisfaction", "Years at Company", "Salary"),
grid = TRUE,
theme = theme_bw())
## plot_model uses some default plot settings and it might interfere with theme() customization if integrated with the above code
z <- z + theme_bw() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
## Add legend to distinguish colors
legend <- grid::textGrob("Red = Negative Coefficients\nBlue = Positive Coefficients",
gp = grid::gpar(col = "black", fontsize = 9.5, fontface = "italic"),
just = "left",
hjust = 0,
x = unit(0.05, "npc")) # Force padding for left alignment
# Combine the plot and the legend in a layout
grid.arrange(z, legend, nrow = 2, heights = c(5, 0.5))##Christine's concern: if run by itself, this code chunk, the visual doesn't show up in the viewer(or does it take longer time to load?). But when rendered it is there.It is apparent, based on the p-values of each variable, that only ‘salary’ and ‘year at company’ are statistically significant at 1% level of significance, together with the intercept. It so happen that both variables also are negatively correlated with the dependent variable, bi attrition. This indicate that higher salary and longer tenure have the ability to reduce the likelihood of employees leaving.
5.4 Analysis of compensation and turnover
Compare the average monthly income of employees who left the company (
bi_attrition = 1) and those who stayed (bi_attrition = 0). Use thet.testfunction to conduct a t-test and determine if there is a significant difference in average monthly income between the two groups. Save the results in a variable calledattrition_ttest_results.Install the
reportpackage and use thereportfunction to generate a report of the t-test results.Install the
ggstatsplotpackage and use theggbetweenstatsfunction to visualize the distribution of monthly income for employees who left and those who stayed. Make sure to map thebi_attritionvariable to thexargument and thesalaryvariable to theyargument.Visualize the
salaryvariable for employees who left and those who stayed usinggeom_histogramwithgeom_freqpoly. Make sure to facet the plot by thebi_attritionvariable and applyalphaon the histogram plot.Provide recommendations on whether revising compensation policies could be an effective retention strategy.
## compare the average monthly income of employees who left and those who stayed
attrition_ttest_results <- t.test(salary ~ bi_attrition, data = hr_perf_dta_no_NA)
## print the results of the t-test
print(attrition_ttest_results)
Welch Two Sample t-test
data: salary by bi_attrition
t = 19.074, df = 5557.5, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
39387.67 48411.52
sample estimates:
mean in group 0 mean in group 1
125856.35 81956.76
## install the report package and use the report function to generate a report of the t-test results
install.packages("report")
# Generate a report of the t-test results
attrition_report <- report(attrition_ttest_results)
print(attrition_report)Effect sizes were labelled following Cohen's (1988) recommendations.
The Welch Two Sample t-test testing the difference of salary by bi_attrition
(mean in group 0 = 1.26e+05, mean in group 1 = 81956.76) suggests that the
effect is positive, statistically significant, and medium (difference =
43899.59, 95% CI [39387.67, 48411.52], t(5557.53) = 19.07, p < .001; Cohen's d
= 0.51, 95% CI [0.46, 0.57])
# install ggstatsplot package and use ggbetweenstats function to visualize the distribution of monthly income for employees who left and those who stayed
install.packages("ggstatsplot")
ggbetweenstats(data = hr_perf_dta_no_NA,
x = bi_attrition,
y = salary,
xlab = "Attrition (0 = Stayed, 1 = Left)",
ylab = "Monthly Income",
title = "Monthly Income Distribution: Employees Who Left vs. Stayed")# create histogram and frequency polygon of salary for employees who left and those who stayed
ggplot(hr_perf_dta_no_NA, aes(x = salary, fill = as.factor(bi_attrition))) +
geom_histogram(alpha = 0.5, position = "identity", bins = 30) +
geom_freqpoly(aes(y = ..density..), bins = 30, color = "black") +
facet_wrap(~ bi_attrition, scales = "free") +
labs(title = "Salary Distribution: Employees Who Left vs. Stayed",
x = "Salary", y = "Count", fill = "Attrition Status") +
scale_fill_manual(values = c("0" = "#f25a0f", "1" = "#485e5b")) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13)) Determine the Significant Difference:
Having ran the t-test for comparison of the average monthly income of employees who left the company and those who stayed, the resulting p-value, 2.2e-16, suggests that there is a statistically significant difference in average salaries between the two groups since it is lower than the common alpha level of 5%. It can also be discern that the employees who stayed have a higher salary than those who left.
Recommendations on revising compensation policies as an effective retention strategy:
With the substantial differences in the salaries between employees who stayed and left, revising the compensation policies is a crucial and effective employee retention strategy. For instance, by adjusting salaries especially for those at risk of leaving may enhance retention and boost overall job satisfaction. Introducing performance-based incentives also creates a mutually beneficial arrangement which aligns employee goals with organizational success. Another strategy is by offering a comprehensive benefits package, which can include health insurance and retirement plans,that can significantly improve employee satisfaction and loyalty. Lastly, the company can invest in training programs and career development initiatives for employees to see opportunities for growth within the organization thus reduce the likelihood of them living the company.
5.5 Employee satisfaction and performance analysis
Analyze the average performance ratings (both
ManagerRatingandSelfRating) of employees who left vs. those who stayed. Use thegroup_byandcountfunctions to calculate the average performance ratings for each group.Visualize the distribution of
SelfRatingfor employees who left and those who stayed using a bar plot. Use theggplotfunction to create the plot and map theSelfRatingvariable to thexargument and thebi_attritionvariable to thefillargument.Similarly, visualize the distribution of
ManagerRatingfor employees who left and those who stayed using a bar plot. Make sure to map theManagerRatingvariable to thexargument and thebi_attritionvariable to thefillargument.Create a boxplot of
salarybyjob_satisfactionandbi_attritionto analyze the relationship between salary, job satisfaction, and attrition. Use thegeom_boxplotfunction to create the plot and map thesalaryvariable to thexargument, thejob_satisfactionvariable to theyargument, and thebi_attritionvariable to thefillargument. You need to transform thejob_satisfactionandbi_attritionvariables into factors before creating the plot or within theggplotfunction.Discuss the results of the analysis and provide recommendations for HR interventions based on the findings.
# Analyze the average performance ratings (both ManagerRating and SelfRating) of employees who left vs. those who stayed.
average_ratings <- hr_perf_dta_no_NA %>%
group_by(bi_attrition) %>%
summarise(
Avg_SelfRating = mean(self_rating, na.rm = TRUE),
Avg_ManagerRating = mean(manager_rating, na.rm = TRUE),
Count = n() # Count of employees in each group
)
print(average_ratings)# A tibble: 2 × 4
bi_attrition Avg_SelfRating Avg_ManagerRating Count
<dbl> <dbl> <dbl> <int>
1 0 3.98 3.48 4448
2 1 3.99 3.46 2261
# Count occurrences of each category in the original data
count_data <- hr_perf_dta_no_NA |>
group_by(cat_self_rating, bi_attrition) |>
summarise(count = n(), .groups = 'drop')
# All possible categories (with zero counts)
all_categories <- data.frame(cat_self_rating = c(
"Unacceptable", "Needs improvement",
"Meets expectation", "Exceeds expectation",
"Above and beyond"))
# Join with all categories to include those with zero counts
final_data <- all_categories |>
left_join(count_data, by = "cat_self_rating") |>
mutate(count = replace_na(count, 0),
bi_attrition = factor(bi_attrition, levels = c(0, 1), labels = c("Stayed", "Left"))) #displaying values with zero counts visually highlight gaps in the data collection or response patternswhich could indicate that those options were either not applicable or deemed irrelevant by respondents.
# Visualize the distribution of SelfRating for employees who left and those who stayed using a bar plot.
f <- ggplot(final_data, aes(x = cat_self_rating, y = count, fill = as.factor(bi_attrition),
text = paste("Count:", "<b>", count, "</b>"))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Distribution of Self Rating: Employees Who Left vs. Stayed",
x = "Self Rating", y = "Count", fill = "Attrition Status") +
scale_fill_manual(values = c("Stayed" = "#f25a0f", "Left" = "#485e5b")) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 13, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(f, tooltip = "text")# Count occurrences of each category in the original data
count_data_2 <- hr_perf_dta_no_NA |>
group_by(cat_manager_rating, bi_attrition) |>
summarise(count = n(), .groups = 'drop')
# All possible categories (with zero counts)
all_categories_2 <- data.frame(cat_manager_rating = c(
"Unacceptable", "Needs improvement",
"Meets expectation", "Exceeds expectation",
"Above and beyond"))
# Join with all categories to include those with zero counts
final_data_2 <- all_categories_2 |>
left_join(count_data_2, by = "cat_manager_rating") |>
mutate(count = replace_na(count, 0),
bi_attrition = factor(bi_attrition, levels = c(0, 1), labels = c("Stayed", "Left")))
# Visualize the distribution of ManagerRating for employees who left and those who stayed using a bar plot.
j <- ggplot(final_data_2, aes(x = cat_manager_rating, y = count, fill = as.factor(bi_attrition),
text = paste("Count:", "<b>", count, "</b>"))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Distribution of Manager Rating: Employees Who Left vs. Stayed",
x = "Manager Rating", y = "Count", fill = "Attrition Status") +
scale_fill_manual(values = c("Stayed" = "#f25a0f", "Left" = "#485e5b")) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 13, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(j, tooltip = "text")# Convert variables to factors
hr_perf_dta_no_NA$cat_job_sat <- factor(hr_perf_dta_no_NA$cat_job_sat, levels = c("Very dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very satisfied"))
hr_perf_dta_no_NA$bi_attrition <- as.factor(hr_perf_dta_no_NA$bi_attrition)
# create a boxplot of salary by job_satisfaction and bi_attrition to analyze the relationship between salary, job satisfaction, and attrition.
ggplot(hr_perf_dta_no_NA, aes(x = salary, y = cat_job_sat, fill = bi_attrition)) +
geom_boxplot() +
labs(title = "Salary by Job Satisfaction and Attrition",
x = "Salary", y = "Job Satisfaction", fill = "Attrition Status") +
scale_fill_manual(values = c("0" = "#f25a0f", "1" = "#485e5b"),
labels = c("0" = "Stayed", "1" = "Left")) +
scale_x_continuous(labels = comma) + #to convert scientific notation to standard numbers
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 13, face = "bold"))#NEEDS TO BE SCRUINIZED AND ENHANCEDBASED ON AVERAGE PERFORMANCE RATINGS FOR EACH GROUP (Summary):
The self-ratings of employees who stayed and left are very similar within both groups, suggesting that self-ratings alone may not be a good predictor of attrition. However, the other table shows that those who stay longer have higher manager ratings which could potentially indicate a correlation between manager perception and attrition. The larger number of those employees who stayed compared to those who left might suggest that the company has a low attrition rate. However, it is important to consider that 2261 employees leaving is still a significant number, and it could be bad for the company’s image.
BOXPLOT VISUALIZATION:
There is a clear trend of increasing salary with higher job satisfaction, as shown by the rightward shift of the salary distribution from “Very dissatisfied” to “Very satisfied.” Employees who left tend to have lower salaries than those who stayed, regardless of satisfaction, and their salaries are clustered within a narrower range. In contrast, salaries of those who stayed are more varied. Overall, the boxplot suggests a correlation between higher job satisfaction and higher salaries.
RECOMMENDATION:
The visuals suggest that salary compensation may be a significant factor in employee retention. Before doing adjustment on the salary structure it is important to make sure that pay levels remain competitive within the industry and for the role, with consideration to the company’s financial resources and ability to afford the proposed salary adjustments. The company can have the option to implement a system that fairly compensates employees based on their contribution and performance to make employees feel more appreciated and value, ultimately making them less likely to leave the company. The company can also review other factors that are affecting employee retention such as training opportunities, flexible work arrangements, work-life balance and company culture.
5.6 Work-life balance and retention strategies
At this point, you are already well aware of the dataset and the possible factors that contribute to employee attrition. Using your R skills, accomplish the following tasks:
Analyze the distribution of WorkLifeBalance ratings for employees who left versus those who stayed.
Use visualizations to show the differences.
Assess whether employees with poor work-life balance are more likely to leave.
You have the freedom how you will accomplish this task. Be creative and provide insights that will help HR develop effective retention strategies.
ggplot(hr_perf_dta_no_NA, aes(x = cat_work_life_balance, fill = as.factor(bi_attrition))) +
geom_bar(position = "dodge", color = "black", size = 0.05) +
labs(title = "Work-Life Balance Ratings by Employee Status",
x = "Work-Life Balance Rating",
fill = "Employee Status") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 13, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = c("0" = "#f25a0f", "1" = "#485e5b"),
breaks = c("0", "1"),
labels = c("Stayed", "Left")) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1)))Satisfied employees, those having higher work-life balance rating, are more likely to remain in the business. What this means for the HR of the organization is to strengthen there existing policies that promotes healthy work-life balance while also aiming to improve it by providing tailored support, such as time off incentives, for those who have “needs improvement” as their rating. Perhaps, HR can conduct interviews or surveys to better understand the concerns of their employees in terms of their work-life balance satisfaction.
5.7 Recommendations for HR interventions
Based on the analysis conducted, provide recommendations for HR interventions that could help reduce employee attrition and improve overall employee satisfaction and performance. You may use the following question as guide for your recommendations and discussions.
Question:What are the key factors contributing to employee attrition in the company?
Answer:
By just looking at the graphs we can pinpoint potential contributors to attrition but these alone doesn’t provide conclusive evidence. The following key factors, based only on graphs are:
- job role
- department
- age
- salary
- manager rating
- work-life balance
Attrition rate appears to be inconsistent and has varying heights of bars across different values of these variables in the x-axis which can indicate potentiality of being contributing factors.
In the correlation test ‘years at company’ shows the strongest correlation with attrition among all the variable but if we follow that threshold of -/+ 0.8 then this cannot suffice. And correlation does not mean causation.
Question:Which factors are most strongly correlated with attrition?
Answer:
Under the correlation test, the variable ‘years at company’ possess a strong correlation with attrition. If concern is on what variables have significant effect on attrition, ‘salary’ and ‘years at company’ shows statistical significance although the effect to attrition is negative.
Question:What strategies could be implemented to improve employee retention and satisfaction?
Answer:
Salary-related strategies have a significant say on the attrition concern of the organization. Making sure that company adopts industry standard salary which ensures competitive compensation can reduce the likelihood of attrition attempt as this will attract talents. Introducing a performance-based pay is another strategy that will enhance job satisfaction as this will incentives employees to excel in their roles while also contributing to the organization’s success. Another approach would be the provision of benefit package, which could include health insurance and bonuses, to improve employee satisfaction.
Work-Life Balance related strategies can also be a significant supplement on the approaches that can minimize employee retention. Other than strengthening the existing policies promoting healthy work-life balance,the organization can also provided an added tailored support, such as flexible work arrangement or encouragement to use vacation days and personal leave, to ensure employee satisfaction wherein they can feel that they are taken care of by the organization, thus avoiding employee attrition.
Question:How can HR leverage the insights from the analysis to develop effective retention strategies?
Answer:
In making sure that there is competitive compensation that adheres to industry standard salary, the HR team can gather and analyze market salary data systematically. For the implementation of performance-based pay, they can set clear performance metric and establish roll out performance based pay systems. And to realized the tailored support fo benefit packages for the employees, the HR team can conduct a survey among employees to have a clear picture of their preferences and needs.
To further promote work-life balance, the HR department of the organization can enhance flexible work arrangement through giving employees the option to work remotely or compressed workweek while making sure that success of the organization is not compromised. To make use the effectiveness of this approach, a regular review of these policies through employee feedback is a must. Similarly to the encouragement of time off work, which creates culture that values rest and rejuvenation, The organization can have reminders about their vacation policies.
Question:What are the potential benefits of implementing these strategies for the company?
Answer:
Implementing these strategies can really help reduce employee turnover and improve retention by making the company a desirable place to work. By offering competitive salaries and attractive benefits, the organization can attract top talent and build a positive workplace culture. Performance-based pay will motivate employees to perform at their best, boosting job satisfaction and engagement. Additionally, promoting work-life balance through flexible arrangements and encouraging employees to take their vacation time can prevent burnout and improve overall morale. In the end, these efforts will create a more cohesive team and support long-term success for the organization.